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Abstract 

In the past two decades , numerous scheduling and load 
balancing techniques have been proposed for locally dis- 
tributed multiprocessor systems. However, they all suffer 
from significant deficiencies when extended to a Grid en- 
vironment; some use a centralized approach that renders 
the algorithm unscalable, while others assume the overhead 
involved in searching for appropriate resources to be neg- 
ligible . Furthermore, classical scheduling algorithms do 
not consider a Grid node to be N -resource rich and merely 
work towards maximizing the utilization of one of the re- 
sources . In this paper, we propose a new scheduling and 
load balancing algorithm for a generalized Grid model of 
N- resource nodes that not only takes into account the node 
and network heterogeneity \ but also considers the overhead 
involved in coordinating among the nodes. Our algorithm 
is de- centralized, scalable, and overlaps the node coor- 
dination time with that of the actual processing of ready 
jobs , thus saving valuable clock cycles needed for making 
decisions. The proposed algorithm is studied by conduct- 
ing simulations using the Message Passing Inteiface (MPI) 
paradigm. 


1. Introduction 

Computational Grids [1, 6] are typically a conglomera- 
tion of various resources with different owners, but make it 
possible for users to develop complex applications that ac- 
cess remote sites. Each of these sites (or nodes) could be a 
uni-processor machine, a symmetric multiprocessor cluster, 
a distributed memory multiprocessor system, or a massively 
parallel supercomputer. Each node consists of a number 
of heterogeneous resources ; the heterogeneity being in the 
type and capability of each of its A-re$ources (e.g., number 
of processors, CPU speed, amount of memory, and so on). 
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Perhaj s the biggest advantage of a heterogeneous Grid en- 
vironment over an isolated multiprocessor system is that it 
can of er resources to the user that are not locally available. 

With the Grid becoming a viable high performance com- 
puting alternative to the traditional supercomputing envi- 
ronme it, various aspects of effective Grid resource utiliza- 
tion aie gaining significance. With its multitude of re- 
source;, a proper scheduling and efficient load balancing 
across the Grid can lead to improved overall system per- 
formal ce and a lower turn-around time for individual jobs. 
Classic al load balancing algorithms [3, 5, 14, 20] address 
this problem by maximizing the utilization of a single re- 
source (generally, CPU). But, the approach loses its merit 
for systems like- the SUN Enterprise, the SGI Origin, and 
the IBM Regatta that offer multiple resources like shared 
memoiy, large disk farms, distinct I/O channels, and soft- 
ware li ;enses that can be independently allocated to differ- 
ent job >. 

Another area where classical and even recent A r -resource 
load b dancing approaches show their deficiency is in 
scalabi ity — not many of them [10, 11, 12, 13, 14, 15, 18] 
can be scaled to the large number of processors in a Grid. 
This drawback is due either to the centralized approach of 
the algorithm [13, 18] or to the need for each node to have 
global ystem knowledge [11]. Also, most algorithms [10] 
either do not consider the overhead of searching for ap- 
propria e nodes or consider it to be negligible. This as- 
sumptii >n is valid for tightly-coupled multiprocessor sys- 
tems [15, 17, 19], but not for geographically distributed en- 
vironm mts like the Grid. 

The present work is targeted to the Grid model where 
each node is assumed to be a A-resource server and any job 
submitt .?d to the Grid can be executed at any node. The only 
information our proposed algorithm needs before a node 
schedul is a job is the communication latency between itself 
and its neighbors, thus making it fully scalable — an impor- 
tant consideration for a wide-area network like NASA’s In- 
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formation Power Grid (IPG) [2, 7]. The overhead involved 
in capturing the resource utilization status of a given node’s 
neighbors before making a scheduling decision can be a ma- 
jor issue negating the advantages of job migration. Our al- 
gorithm therefore overlaps the time spent looking for appro- 
priate nodes with the actual execution of the ready jobs, thus 
saving precious clock cycles. Also, since each Grid node 
(whether a single uni-processor machine or a multiproces- 
sor system) can have its own independent scheduling algo- 
rithm, our technique does not overrule the local schedulers’ 
job assignment policy. The class of problems we address 
is where jobs are computation-intensive and can be divided 
into totally independent sub-tasks with no communication 
between them. 

We have conducted extensive experiments using the 
Message Passing Interface (MPI) paradigm and by simulat- 
ing the job arrival rate. We compared the quality of load 
balance with the ideal case (where no overheads are in- 
volved) and found that our algorithm performs remarkably 
well in an heterogeneous Grid environment and gives en- 
couraging results. The remainder of this paper is organized 
as follows: Section 2 describes our algorithm and presents 
pseudo codes of the key procedures; Section 3 discusses the 
experimental setup that we used to test and substantiate our 
claims, and interprets the results; and Section 4 concludes 
the paper. 

2. Scheduling and load balancing 

Two important aspects of any wide area network sched- 
uler are its transfer [4, 15] and location [8, 9] policies. The 
transfer policy decides if there is a need to initiate load bal- 
ancing across the system, and is typically threshold based. 
Using workload information, it determines when a node be- 
comes eligible to act as a sender (transfer a job to another 
node) or as a receiver (retrieve a job from another node). 
The location policy selects a partner node for a job transfer 
transaction. In other words, it locates complementary nodes 
to/ffom which a node can send/receive workload to improve 
overall system performance. 

Location policies can be broadly classified as 
sender-initiated [4, 21], receiver-initiated [4, 12], or 
symmetrically -initiated [5, 15, 19]. Sender-initiated 
policies are those where heavily-loaded nodes search 
for lightly-loaded nodes while receiver-initiated policies 
are those where lightly-loaded nodes search for suitable 
senders. Symmetrically-initiated policies combine the 
advantages of these two by requiring both senders and 
receivers to look for appropriate partners. 

Load balancing policies can also be classified on the ba- 
sis of how up-to-date each node’s knowledge is about the 
state of the system. Dynamic [16, 17] policies make de- 
cisions based on the current system state and can rapidly 


adapt to workload fluctuations. On the other hand, policies 
that use static information and are not amenable to changes 
in the workload are known as static [3] policies. How- 
ever, dynamic policies incur the overhead of communicat- 
ing among the system nodes to keep them informed about 
the state of the system. 

In this section, we describe our scheduling and load bal- 
ancing algorithm for /v-resource Grid environments. It is 
dynamic, sender-initiated, and completely de-centralized. 
The last feature makes it extremely scalable for Grid en- 
vironments. A remarkable property of our algorithm is that 
it uses a smart search strategy for finding partner nodes. It 
also overlaps this decision making process of a node with 
the actual execution of ready jobs, thereby saving precious 
processor cycles. 

2.1. Preliminaries 

Before discussing the algorithm, let us introduce the con- 
cepts of Internal and External queues, which we assume ex- 
ist in each Grid node. The Internal Queue of a node consists 
of the ready jobs which would be executed by this particu- 
lar node only. Let r be the time when the tasks were last 
mapped, a(tj) be the arrival time of task tj , and e(tj) be 
the time tj starts executing. Then, the jobs in the Internal 
Queue are those that have been mapped and scheduled to 
this node, and are either being executed (Eq. 1) or are ready 
to be executed (Eq. 2); they would never be delegated to any 
other node: 

{tj I a(tj) < t, e(tj) < r) (1) 

{tj I a (tj) < t, e(tj) > t} (2) 

Instead, the External Queue of a node consists of jobs which 
have been initially submitted to this node by a user, but are 
yet to be mapped and scheduled for execution (Eq. 3): 

{tj | a(tj) > r,e(tj) > t} (3) 

Let us now enumerate the key notations we will be using 
throughout the paper to explain our algorithm: 

• Pi : Grid node i 

• P-: The j-t h resource of Pi 

• Jk : lob k 

• Jj.: Ideal requirement for the j-th resource by J \ 

• Neigh(Pi): Immediate neighbors of Pi 

• Compi(t ): Time needed by Pi to empty its Internal 
Queue assuming no more jobs are assigned to it after time t 

• C omm \ : Communication latency between Pi and Pj 

• ExQy. Number of jobs in the External Queue of Pi 

We assume that each Grid node has knowledge about the 
communication latency between itself and ail of its neigh- 
bors; i.e., each node Pi knows Comm Vj E Neigh{Pi), 

Not only does this make the algorithm highly scalable, it 
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also allows the network to conveniently accommodate any 
changes in its topology. 

We also postulate that each incoming job knows its re- 
quirements for each of the resources available at a node. 
In order to generalize this concept, we define A r -resource 
jobs and A r -resource nodes/servers. Each job Jk looks for 
a node Pi with resources P £ °, P/, . . P/^” 1 , such that it 
meets its requirement for each resource type, J \ , . . 

The algorithm described below would be executing 
on every node of the Grid. 

2.2. Proposed algorithm 


Whenever a job is submitted by a user to a node P», pro- 
cedure Mazh (Fig. 1) invokes procedure NeedForTriggering 
(Fig. 2) to make a decision whether the job need to be mi- 
grated. If the job ought to be migrated to another node, 
a request is sent to all nodes j € Neigh(Pi), provided 
2 x Comm\ < TimejQ . This condition implies that the 
status request to the neighboring nodes and their responses J 
should be received before the Internal Queue is emptied 
(denoted by TimejQ ). This strategy avoids any wastage 
of the node’s resources; the inequality overlaps the task of 
looking for appropriate nodes wdth the actual processing of 
the Internal Queue , thus hiding the overhead. 


Procedure Main 
Repeat forever 

If (a new job submitted) 

Time Current System Time (CST) 
NeedForTriggering (a, Time) 

If ( NeedForTriggering returns TRUE) 

TimeiQ <— Time to empty Internal Queue 
Vjr e Neigh(Pi) 

If (2 x Commj < TimeiQ) 

Request {j, Commj . T ime/Q ) 

Receive ( TimejQ ) 

Balance (S, R) 

End If 
End If 
End Repeat 
End Main 


Figure 1. Procedure Main 

Refer to Figures 2 and 3 for the triggering policy we have 
incorporated into our algorithm. It is based on the simple 
heuristic that greater the load at a node, the less inclined 
would it be to accept future loads. Within a time window 
of Compi(r) y triggering is initiated if the traffic burst is 
more than admissible; however, higher the resource usage, 
the smaller is the traffic burst that a node will accommodate 
(Fig. 3). 


Prc- :edure NeedForTriggering ( a, T ime ) 

i 4- S -5- a /* S is Cummulac ive Load */ 
I (CST— Time < Compi(r)) 

If {<5 > Admissible Load at r) 
r CST 
Return TRUE 

E Ise 

Commi t S to Internal Queue 
T <- CST 

5 4- o 

Return FALSE 
End If 

End NeedForTriggering 


Figure 2. Procedure NeedForTriggering 



Figure 3. Value of 6 when Job Queue is x% full 


A n< de, having received a request to send the status of its 
resources, packs the information about their current utiliza- 
tion ant: sends it back to the requesting node along the route 
the request came (Fig. 4). This route is also piggybacked 
to the rode which needs to migrate load. Besides replying 
to requt sts, a node also recursively pings its neighbors for 
their resource status if its database says that the total round- 
trip latency between the sender and its neighbor would be 
less than TimeiQ. This allows the time required to look for 
additior al resources be hidden under processing. 


Procedure Request {i, y } TimeiQ ) 

C reate Set S 

£. Route Route followed to reach i 
S.ResStatus <— Current usage of 

{p? y pr...,p x N ~'} 

v?I_Send (S to i) 

W € Neigh(Pi) 

If (2 x (7 + Comm{) < TimeiQ ) 

Request ( j , 7 -I- Commi 5 TimeiQ ) 

End If 
End Request 


Figure 4. Procedure Request 
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3. Experimental study 


Figure 5 shows the pseudo code for procedure Receive . 
The sender waits for time TimejQ to get replies from the 
nodes that have been queried for the status of their re- 
sources. 


Procedure Receive [Time iq ) 
While ( Time < TimejQ) 
MP I .Receive (S) 

End While 

R Number of replies 

Return R 

End Receive 


Figure 5. Procedure Receive 


Figure 6 shows our procedure to schedule the jobs soon 
after TimejQ elapses. Without loss of generality, we can 
assume that 0.0 < P ( , j[ < 1.0, 0 < j < N — 1. Let 
be a match variable which defines the number of re- 
sources in node Pi that fulfill the requirements of job J*. If 
bool(J k < P() is 1 and bool(j£ > P?) is 0, then we can 
formally define as 


Here we describe the metrics used to gauge the perfor- 
mance of our scheduling and load balancing algorithm, the 
setup we had for our experiments, the simulation results that 
were obtained, and the conclusions we can draw from them. 

3.1. Performance metrics 


We analyze the performance of our algorithm using a 
parameter called Normalized Performance, rj (defined in 
Eq. 5). Basically, rj is the effectiveness of the load balanc- 
ing strategy. It is a comprehensive metric as it considers 
both the initial load balance as well as the load balancing 
overheads: 


Tno - T my 
Tno — Tib 


(5) 


Here, T no is the time to completely process all the jobs on a 
uniprocessor machine; Ta is the time required by one pro- 
cessor divided by the total number of processors, thus pro- 
viding the runtime with ideal load balancing; and T my is 
the time needed by our algorithm to balance the load and 
execute all the jobs. Clearly, 


if T m .y -> To,, then r\ -4 1 (6) 


N- 1 

M t k bool(J{ < Pi) (4) 

0 

Clearly, 0 < < N. Now, let us define matrices T 

and C, and vector V, as described in steps 1, 2, and 3 of 
procedure Balance (Fig. 6). Intuitively, the u-th row and 
Ar-th column of T gives the number of resources in node P{ 
that meets the requirements of job J the fc-th entry of V 
gives the number of nodes which satisfy the minimum re- 
quirements of Jk\ and element C u j denotes that there is a 
common node that satisfies the requirements of both J u and 
Jjy and that there might be a conflict while scheduling them. 

Another possible scenario is when the set of nodes that 
satisfy the requirements of J u is a subset of the set which 
satisfies the requirements of Jj; in such cases, giving pref- 
erence to Jj might leave J u with no viable option. To 
avoid such cases, our algorithm first schedules jobs that 
have the fewest choices. Tf Uim<n (vj)) in step 4.1 of Fig. 6 
corresponds to the job J m i n that has the minimum num- 
ber of nodes it can be mapped to. The variable z indi- 
cates the node P z to which J min can be delegated. Step 4.2 
checks matrix C and, in case there is another job that can be 
mapped to P z , chooses a different z for J mjn , if possible. 
Finally, J min is mapped and scheduled to P z . This mech- 
anism continues until all jobs have been scheduled or until 
no more can be mapped because of the lack of resources. 


if T my -* T no , then rj -4 0 (7) 

These two conditions imply that higher the value of 77, the 
better is the load balancing; the ideal case being rj = 3. 

3.2. Experimental setup 

The experimental results reported in this paper were ob- 
tained by using an MPI implementation of our proposed al- 
gorithm. It is worth mentioning here that the various pa- 
rameters of our algorithm were varied following a Poisson 
distribution. Their respective mean values are given in Ta- 
ble 1. 


Table 1. Variables used in the experiments 


Variables 

Mean 

Simulated by | 

Processing Power 
Requirements 

2-16 

50 floatingpoint 
multiplications per unit 

Memory 

Requirements 

2-16 

1KB of memory 
allocated & freed per unit 

I/O Requirements 

2-16 

1KB of data written 
to disk per unit 

Network Latency 

5-11 

sleep(3) per unit 

Node Degree 

5 

number of 
neighboring nodes 
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Procedure Balance ( S , R ) 

1. Using S, define matrix T of dimensions ExQi x R where T Ut k ' M* 

2. Define vector V of dimension ExQi where V- 6oo/(T u ,jfc = A T ) / 1 < k < ExQi 

3. Define matrix C of dimensions x ExQi vhere 

C lk <_ Ck,i <- 1, if T kJ = Tij = N; 0 , otherwise; 1 < Z, A; < ExQ { , 1 <j<R 

4. Repeat until (no more jobs can be mapped) 

4.1 Z <- U | TfaminfVj)) = N f 1 < u < R, 1 < j < ExQi 

4.2 If (C C7nin( v,),fc) =1, l<fc<£xQi) 

Choose another z, if possible 

4 . 3 Assign J m in(.v } ) to node P z , 1 < j < ExQi 

4.4 Remove row min(Vj) , 1 <j < ExQi and colurrn z from T 
End Balance 


Figure 6. Procedure 2 alance 


Experiments were conducted for three different values of 
Max (15, 20, and 25) (see Fig. 3), and repeated for 1-, 2-, 
and 3-resource nodes. The following three inequalities give 
the relationships between the mi’s, where each m* refers to 
the slope of the line joining the co-ordinates (0, Max) and 
100, 0) (Fig. 3): 


mi, m 2 , m 3 < 0 

( 8 ) 

mi < m 2 < m 3 

(9) 

mi | > 1 77i 2 | > | m 3 | 

( 10 ) 


3.3. Simulation results 

We have conducted extensive experiments to evaluate the 
performance of our algorithm and help us substantiate our 
approach. Figures 7 through 9 illustrate the results obtained 
from the study. 

To verify that our algorithm works well for completely 
heterogeneous systems, we divided the experiments into 
three groups. The first set of experiments was run on sys- 
tems where heterogeneity was in the capabilities of the 
N -resources of a node; thus, the communication latency 
between all neighboring nodes was constant. The second 
set involved keeping the node capability constant and vary- 
ing only the communication latency between the nodes. Fi- 
nally, the third set of experiments combined the above two 
approaches, thereby exposing a totally heterogeneous setup 
to various load conditions (that were varied by changing the 
job arrival rate and the load associated with each job). Each 
set of experiments was repeated for 1 -, 2 -, and 3 -resource 
nodes. The objective w'as to evaluate the algorithm thor- 
oughly by taking various scenarios of heterogeneity into 
consideration. 

Results for the first set (where only the capabilities of the 
N-resources of a node are varied while keeping all other 
factors unchanged) are summarized by the graphs in Fig. 7. 
The horizontal axis represents the Mean Node Capacity of 


the ne rwork which can be defined as the mean value used 
for the capacity of each of the resources in a node (all re- 
source having the same mean). Increasing the resource ca- 
pabilit / of the nodes without changing the job resource re- 
quirements effectively reduces the granularity of the latter. 
As de acted , any increase in node capability increases rj . 
Howe' er, as the threshold slopes (mi’s) become steeper, t\ 
decreases. This is because the frequency of triggering the 
load balancing algorithm is reduced. 

In the second set of experiments, the Mean Node Capac- 
ity wa; held constant while varying the communication la- 
tency. rhe results presented in Fig. 8 show that ry decreases 
with increasing communication cost. As in the previous set, 
the alg orithm performs best when the absolute value of the 
thresh< Id slope is the smallest (m 3 in this case). 

For the final set of experiments, we vary the input load 
for a setup which has a heterogeneous mix of resource capa- 
bility Bid communication latency. This was repeated for 1-, 
2-, anc 3-resource job specification for a 3-resource node. 
Figure 9 show's that the execution time decreases as we get 
more s >ecific about job requirement. 

4. Conclusions 

In tl is paper, we presented a highly de-centralized, dis- 
tributee , and scalable algorithm for scheduling tasks and 
load balancing resources in heterogeneous Grid environ- 
ments. Our algorithm takes into consideration the over- 
heads of coordination and communication between the Grid 
nodes which were assumed to be 7V-resource servers that 
varied m their respective capacities across resources. The 
goal w.:s to assign each node a job which would utilize 
its reso Jrces in the best possible manner, thus providing 
an effec tive scheduling and resource management strategy. 
We introduced a new load balance triggering policy based 
on the endurance of a node reflected by its current queue 
length. Also, our algorithm overlaps the time needed for 













Figure 9. Execution Time (T my ) vs. Average Load 


various communication overheads with that of executing the 
jobs already committed to the nodes, making the effective 
time for overheads virtually zero. The algorithm has been 
discussed in detail with pseudo codes being provided for all 
the major modules of the algorithm. 

To substantiate our claims, a comprehensive experimen- 
tal study was conducted using the Message Passing Inter- 
face (MPI) paradigm. Heterogeneity in resource capabil- 
ities and communication latency was maintained while re- 
peating the set of experiments for 1-, 2-, and 3-resource jobs 
and nodes. The Normalized Performance parameter was 
0.79 for 3-resource nodes and as high as 0.85 for 1 -resource 
nodes. These excellent performance levels could be attained 
only by overlapping the various overheads with the actual 
execution of the jobs. 
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